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Abstract 



Full exploitation of instruction-level parallelism by superscalar and similar 
architectures requires speculative execution, in which we are willing to issue a 
potential future instruction early even though an intervening branch may 
send us in another direction entirely. Speculative execution can be based ei- 
ther on branch prediction, where we explore the most likely path away from 
the branch, or on branch fan-out, in which we explore both paths and 
sacrifice some hardware parallelism for the sake of not being entirely wrong. 
Recent techniques for branch prediction have greatly improved its potential 
success rate; we measure the effect this improvement has on parallelism. We 
also measure the effect of fan-out, alone and also in combination with a 
predictor. Finally, we consider the effect of fallible instructions, those that 
might lead to spurious program failure if we execute them speculatively; 
simply refusing to do so can drastically reduce the parallelism. 



1 Introduction 



Recent years have seen a great deal of interest in multiple -issue machines [1, 6, 9], machines that 
can issue several mutually independent instructions in the same cycle. These machines exploit 
the parallelism that programs exhibit at the instruction level. 

It is important to know how much parallelism is available in typical applications. Machines 
providing a high degree of multiple-issue would be of little use if applications did not display 
that much parallelism. The available parallelism depends strongly on how hard we are willing to 
work to find it. Recent studies studies [4, 5, 6, 13, 14, 15, 16, 17] have led to a growing consensus 
that high levels of parallelism are available only by doing speculative execution, in which we can 
issue an instruction whose data dependencies are satisfied even though its control dependencies 
are not. That is, we issue a potential future instruction early even though an intervening branch 
may send us in another direction entirely. 
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There are two approaches to the selection of instructions to execute speculatively. We can 
do branch prediction, trying to guess whether a conditional branch will be taken so we know 
which of the two possible paths to follow in selecting instructions. Or we can fan out and select 
instructions from both possible paths, spending some of our machine parallelism for the assurance 
that at least some of the instructions we speculatively execute will be useful. It is possible to use 
a combination of the two, fanning out part of the time and predicting the rest of the time. 

This paper presents results concerning three questions. First, recent work in branch predic- 
tion [8, 10, 18, 19] has shown how to use very large predictors to improve the performance of 
hardware predictors from around 92% success to around 98%. What effect does this have on 
instruction-level parallelism? Second, how useful is a fan-out capability, both by itself and in 
combination with a predictor? Third, on some architectures certain instructions must not be exe- 
cuted speculatively because they can cause run-time exceptions. Does this cripple a multiple-issue 
machine, or can it be tolerated? 

We start with an overview of branch prediction and fan-out techniques. We then describe our 
experimental environment, based like many others on trace-based simulation. Finally we present 
our results, and some conclusions. 

2 Branch prediction 

Branch prediction can be done statically or dynamically. Static prediction based on the direction 
of the branch or other heuristics is only somewhat effective, but prediction based on a profile of 
a previous run of the application is successful around 90% of the time. Dynamic prediction is 
normally done in hardware, with the prediction for a given branch based on recent events in the 
execution. A common hardware branch predictor [7, 12] maintains a table of saturating two-bit 
counters. Low-order bits of a branch's address provide an index into this table, associating a 
counter with each branch; if the table is small then the program space wraps around, possibly 
associating the same counter to several branches across the program. We predict that a branch 
will be taken if the associated counter is 2 or 3, and otherwise predict not taken. Later, when 
the branch is resolved, we increment the counter if it was taken, and otherwise decrement it. A 
predictor of 512 counters is successful about as often as a profile, but unfortunately increasing 
the size of the table does not help much; the success rate levels off at 92% or 93% regardless of 
the table size. 
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Recent studies have explored more sophisticated hardware prediction using branch histo- 
ries [10, 18, 19]. These approaches maintain tables relating the recent history of the branch (or of 
branches in the program as a whole) to the likely next outcome of the branch. These approaches 
do quite poorly with small tables, but unlike the two-bit counter schemes they can benefit from 
much larger predictors. 

An example is the local-history predictor [18]. It maintains a table of n-bit shift registers, 
indexed by the branch address as above. When the branch is taken, a 1 is shifted into the table 
entry for that branch; otherwise a 0 is shifted in. To predict a branch, we take its n-bit history 
and use it as an index into a table of 2 n 2-bit counters like those in the simple counter scheme 
described above. If the counter is 2 or 3, we predict taken; otherwise we predict not taken. If the 
prediction proves correct, we increment the counter; otherwise we decrement it. The local-history 
predictor works well on branches that display a regular pattern with a small period. 

Sometimes the behavior of one branch is correlated with the behavior of another. A global- 
history predictor [18] tries to exploit this effect. It replaces the table of shift registers with a single 
shift register that records the outcome of the n most recently executed branches, and uses this 
history pattern as before, to index a table of counters. This allows it to exploit correlations in the 
behaviors of nearby branches, and allows the history to be longer for a given total predictor size. 

An interesting variation is the gshare predictor [8], which uses the identity of the branch as 
well as the recent global history. Instead of indexing the array of counters with just the global 
history register, the gshare predictor computes the xor of the global history and branch address. 

McFarling [8] got even better results by using a table of two-bit counters to dynamically choose 
between two different schemes running in competition. Each predictor makes its prediction as 
usual, and the branch address is used to select another 2-bit counter from a selector table; if the 
selector value is 2 or 3, the first prediction is used; otherwise the second is used. When the 
branch outcome is known, the selector is incremented or decremented if exactly one predictor 
was correct. This approach lets the two predictors compete for authority over a given branch, and 
awards the authority to the predictor that has recently been correct more often. McFarling found 
that combined predictors did not work as well as simpler schemes when the total predictor size 
was small, but did quite well indeed when large. 
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3 Branch fan-out 

Rather than try to predict the destinations of branches, we might speculatively execute instructions 
along both possible paths, squashing the wrong path when we know which it is. Some of our 
hardware parallelism capability is guaranteed to be wasted, but we will never miss out completely 
by blindly taking the wrong path. Unfortunately, branches happen quite often in normal code, so 
for large degrees of parallelism we may encounter another branch before we have resolved the 
previous one. Thus we cannot continue to fan out indefinitely: we will eventually use up all the 
machine parallelism just exploring many parallel paths, of which only one is the right one. 

In some respects fan-out duplicates the benefits of branch prediction, but they can also work 
together. We explore both paths up to the fan-out limit, and then explore only the predicted path 
beyond that point. 

4 Fallible instructions 

In most architectures, some instructions can fail, causing an exception. Examples are memory 
references, which can cause segmentation violations, and floating-point operations, which can 
cause several kinds of traps. Speculatively executing a fallible instruction is dangerous, because it 
might make a correct program fail; to avoid this, the hardware must somehow make the exception 
itself speculative, so that the failure does not occur until we are sure that the instruction should 
have been executed. 

The easy way out is simply to refuse to speculatively execute a fallible instruction. This 
is likely to degrade the parallelism, since it will also delay safe instructions that depend on the 
fallible instruction, but it eliminates the need for hardware trickiness. 

5 Simulation environment 

To study the effects of these issues on instruction-level parallelism, we used the trace-based 
simulator described in detail in an earlier report [17]. An instruction trace of the application is 
passed, one instruction at a time, to the scheduler. The scheduler places each instruction into 
some cycle of a sequence of pending cycles, subject to dependencies with previously scheduled 
instructions. Whether there is a dependency is determined by the parallelism model we use. If 
the model does not include branch prediction, for example, then each instruction appearing after 
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Figure 1 : Fraction of branches predicted correctly by three different prediction 
schemes, as a function of the total number of bits in the predictor 



a branch in the trace must be scheduled after that branch in the pending cycles. If the model does 
include branch prediction, in contrast, we can schedule later instructions into cycles before the 
branch, if the predictor is successful; otherwise we must assume that a real machine would have 
speculatively executed instructions from the wrong path, and would only start looking down the 
correct path when execution of the branch instruction reveals the misprediction. 1 

The simulator uses a greedy scheduling algorithm, placing each instruction as early as possible 
in the pending cycles, given the instructions that preceded it in the trace. Each cycle can hold a 
maximum of 64 instructions, and the entire sequence of pending cycles can hold 2048 instructions. 
When the number of pending instructions exceeds that number, we "issue" the first cycle, which 
prevents us from scheduling any more instructions in it. 

For the purposes of this paper, the parallelism model simulated is specified by four parameters: 
branch prediction and fan-out, fallibility, register renaming, and memory disambiguation. The 
full system is somewhat more flexible than this. 

In this paper we are interested in the effect of varying the size of the branch predictor. Different 
predictors do the best in different regions of this spectrum of size. Figure 1 shows the harmonic 

1 This approach to a missed prediction ignores the possibility that code could be moved from after the point where 
the paths rejoin to a position before the paths split apart. Recognizing such opportunities is difficult in hardware but 
feasible in a software scheduler [4, 14]. 
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mean of the success rates of three predictors for twelve SPEC92 benchmarks, as the predictor 
size varies. The two-bit-counter predictor does best for predictor sizes up to 5 12 bits. A predictor 
built by combining a counter predictor and a gshare predictor does best in the middle range up 
through 4K bits, and a combination of a local predictor and a gshare predictor works best above 
4K bits. Throughout this paper, when we speak of a predictor of a particular size, the range that 
includes this size will determine the prediction technique used. 

Branch fan-out is a little trickier to model. We want to explore both paths away from a branch, 
but the simulator has only the correct instruction trace to work from and therefore cannot actually 
schedule instructions from paths not taken. Exploring these false paths on a real machine would 
use up hardware parallelism, however, especially since we will likely have to schedule another 
branch before the first is issued and resolved. We model this approximately by assuming that 
there is a. fan-out limit on the number of branches we can look past. If our model has a non-zero 
fan-out limit, we can handle branches beyond that limit either by giving up or by conventional 
branch prediction. 

There are two kinds of instructions we can consider fallible: floating-point binary operations 
and memory-reference instructions. In this paper we arbitrarily allowed only heap references 
to fail, on the perhaps generous assumption that program analysis or language semantics could 
preclude the failure of references to stack or static data. 

This paper is not directly concerned with the effects of register or memory dependencies. 
To provide a small selection of contexts for our exploration of branch analysis and fallibility, 
however, we assumed four different base models. The alpha model assumes perfect memory 
disambiguation, so that a store conflicts with a load or store only if the two actually reference 
the same word in memory, and assumes an infinite number of registers with a perfect renaming 
scheme, so that we never have output dependencies or antidependencies between registers. The 
beta model also assumes perfect memory disambiguation, but assumes 64 CPU registers and 64 
FPU registers, managed dynamically by a hardware renaming scheme using an LRU discipline 
(relative, of course, to the position in the scheduled cycles rather than in the instruction trace). The 
gamma model assumes perfect memory disambiguation and no register renaming, so that register 
conflicts are determined by the registers actually allocated by the DECstation compiler. The delta 
model assumes no register renaming and simple but very conservative memory disambiguation 
by instruction inspection, a common technique used in compile-time instruction-level pipeline 
schedulers: two instructions do not conflict if (a) they use they use the same base register but 
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Figure 2: The four base models of register renaming and memory disambiguation 

different displacements, or (b) one uses a register known to point to the stack and the other one 
known to point to the global data area. Figure 2 summarizes these four models. 

In all four of these models we assume that all indirect jumps (chiefly procedure returns, calls 
to procedure variables, and case-statement indexed jumps) are predicted perfectly. Procedure 
returns are easy to predict with simple hardware, but other jumps are less so. Assuming perfect 
jump prediction is therefore generous but probably not consequential; indirect jumps are rare 
enough in the programs we tested that jump prediction has a significant effect on parallelism only 
when branch prediction is also perfect. 

All of our simulations were done with a set of twelve programs from the SPEC92 suite. (The 
rest of them run too long for our simulation to be feasible.) We usually gave them the official 
"small" data sets where possible, and in the case of tomcatv and alvinn we modified the value of 
a constant to reduce the number of iterations of the outer loop. 

6 Results 

Our first experiment measured the parallelism as the total size of the branch predictor increased. 
As described earlier, different predictors have better success rates in different size ranges, so we 
use different predictors for the small, intermediate, and large predictor sizes. (This is why some 
benchmarks show a sudden change around 512 bits or 4K bits.) Figure 3 shows the results for 
each of our four base models. The solid curves are integer benchmarks; the dotted curves are 
floating-point benchmarks. 

Under the alpha model, with infinite registers and perfect memory disambiguation, we see 
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Figure 3: Effect of branch predictor size on parallelism 
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that a few programs benefit considerably from large predictors. Gccl continues to improve even 
as we reach a predictor of three-quarters of a megabit. Both li and espresso improve 40% between 
1 kilobit and 1 megabit; the harmonic mean of the improvements over that range is 25%. It is 
interesting that the programs least sensitive to the size of the predictor are those most parallel and 
those least parallel. 

Under the beta model we see roughly the same behavior, though it is not as pronounced. 
Now the mean payoff of the biggest predictor over the 1-kilobit predictor is about 14%. When 
we eliminate first register renaming and then perfect memory disambiguation, we see that the 
advantage of a very large predictor evaporates almost completely. The largest improvement from 
1Kb is 13%, but few programs do even close to this well; the mean is more like 2%. 

Unsurprisingly, we see that a very large branch predictor can be helpful, but only if we get 
everything else right. 

Next we consider the effects of fanning out at branches. Figure 4 shows the mean parallelism 
over the 12 programs as the fan-out limit increases, for each of the four base models. The left-hand 
graph assumes that the fan-out capability is working alone, without subsequent branch prediction: 
when the fan-out limit is reached we can look beyond no more branches for instructions to issue. 



9 



Speculative Execution and Instruction-Level Parallelism 



The right-hand graph assumes that fan-out is followed by branch prediction: when the fan-out 
limit we can continue to look past branches for instructions, but only along the predicted path. 
The predictor used is a modest one, a simple counter-based predictor of 256 entries. 2 

Without branch prediction, a little fan-out helps a lot, even in the poorer base models. Fanning 
out past just 1 level of branching improves the parallelism of gamma and delta by around 30%, 
and of alpha and beta by around 50%. Increasing the fan-out limit continues to improve things 
significantly, but the effects are not as dramatic. 

Interestingly, fanning out even to a level of 8 branches gives us a parallelism in each model 
that is nearly the same as the parallelism from using the half-kilobit predictor with no fan-out 
at all. Adding eight levels of fan-out to this predictor improves the parallelism somewhat, by 
30-45% in alpha and beta, and by about 7% in gamma and delta. 

Thus an ambitious fan-out capability could be an adequate substitute for branch prediction, 
though it is hard to imagine the circumstances in which it would be easier to implement. Adding 
branch prediction to even a modest predictor does not buy us much unless (again) we do a very 
good job of handling register and memory dependencies. 

We assumed that two kinds of instructions could fail: binary floating-point operations, and 
heap memory references. In the actual traces, of course, these operations never fail; since we could 
not know that in advance, we model their fallibility by insisting that they always be scheduled 
later than any previous branch. We also experimented with models in which only one of these two 
classes of instructions are fallible. These proved uninteresting, because the behavior of the twelve 
SPEC92 programs is bimodal: the integer programs do essentially no floating-point operations, 
and the floating-point programs make few or no heap references. In either case, assuming that 
only one could fail gave results essentially identical to assuming that neither or both could fail. 

Figure 5 shows the results, for the four base models and two different predictor sizes. We 
have separated out the integer from the floating-point programs, and present the harmonic mean 
parallelism for each. The upper curve in each pair is the parallelism without fallible instructions; 
the lower is with fallible instructions. 

Fallibility has a larger effect on the integer programs than on the floating-point programs. It 

2 The simulator's viewpoint is the reverse of the hypothetical hardware's. The simulator schedules each successive 
instruction into one of the pending cycles, so to implement a fan-out of n without prediction it allows the instruction 
to be scheduled earlier than the previous n branches, but not the n + 1st. To implement fan-out followed by branch 
prediction, we tentatively predict every branch, and allow an instruction to be scheduled before any number of 
successfully predicted branches, preceded by n more branches whether predicted successfully or not. In other words, 
the instruction must be scheduled after the nth branch before the last incorrectly predicted branch. 
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Figure 5: Effect of fallible instructions on parallelism with 0.5 -kilobit branch 
predictor (left) and with 0.7-megabit branch predictor (right) 

reduces the parallelism of integer programs so much that the most ambitious model has barely 
half again the parallelism of the poorest. 

Fallibility has its greatest effect on the more ambitious models. It can cut the parallelism of 
a good model in half but rarely reduces the smaller parallelism of a poorer model by more than 
a fifth. Evidently (and perhaps obviously) the more different kinds of bottlenecks to scheduling 
you have, the less another one matters. 

7 Conclusions 

The qualitative conclusions of this study should come as no great surprise, though we hope the 
quantitative results will serve as useful hints to the architecture and compiler communities. 

Very good branch prediction from megabit history-based predictors can significantly improve 
parallelism, though the magnitude of this improvement was not as great as we had hoped to see. 
The payoff of a large predictor is probably negligible unless we also take strong action to reduce 
false register dependencies and disambiguate memory references. 

Fanning out across many levels of branches can in principle be a substitute for modest branch 
prediction, though a large predictor has no trouble beating it. Since fan-out is likely to be harder 
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to implement than ordinary prediction, it is probably more interesting to note that adding fan-out 
to prediction can improve it. As before, however, the improvement is significant only if we have 
false register and memory conflicts well under control. 

These results confirm that we really need to work with a combination of very good techniques 
if we want to achieve high levels of parallelism. It is therefore important to note that refusing 
to execute fallible instructions speculatively can halve the parallelism of the more ambitious 
models. Techniques that allow failures to be postponed until we are sure they were supposed to 
happen [2, 3, 1 1] are essential to the full exploitation of instruction-level parallelism. 
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